Serveur d'exploration sur la TEI

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Mining a corpus of biographical texts using keywords

Identifieur interne : 000059 ( Main/Exploration ); précédent : 000058; suivant : 000060

Mining a corpus of biographical texts using keywords

Auteurs : Mike Conway [Japon]

Source :

RBID : ISTEX:7C87E90B31174A63CC37AC19EBEA8261A9D411BB

Abstract

Using statistically derived keywords to characterize texts has become an important research method for digital humanists and corpus linguists in areas such as literary analysis and the exploration of genre difference. Keywordsand the associated concepts of keyness and key-keynesshave inspired conferences and workshops, many and varied research papers, and are central to several modern corpus processing tools. In this article, we present evidence that (at least for the task of biographical sentence classification) frequent words characterize texts better than keywords or key-keywords. Using the nave Bayes learning algorithm in conjunction with frequency-, keyword-, and key-keyword-based text representation to classify a corpus of biographical sentences, we discovered that the use of frequent words alone provided a classification accuracy better than either the keyword or key-keyword representations at a statistically significant level. This result suggests that (for the biographical sentence classification task at least) frequent words characterize texts better than keywords derived using more computationally intensive methods.

Url:
DOI: 10.1093/llc/fqp035


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Mining a corpus of biographical texts using keywords</title>
<author wicri:is="90%">
<name sortKey="Conway, Mike" sort="Conway, Mike" uniqKey="Conway M" first="Mike" last="Conway">Mike Conway</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:7C87E90B31174A63CC37AC19EBEA8261A9D411BB</idno>
<date when="2010" year="2010">2010</date>
<idno type="doi">10.1093/llc/fqp035</idno>
<idno type="url">https://api.istex.fr/document/7C87E90B31174A63CC37AC19EBEA8261A9D411BB/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000385</idno>
<idno type="wicri:Area/Istex/Curation">000385</idno>
<idno type="wicri:Area/Istex/Checkpoint">000030</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000030</idno>
<idno type="wicri:doubleKey">0268-1145:2010:Conway M:mining:a:corpus</idno>
<idno type="wicri:Area/Main/Merge">000059</idno>
<idno type="wicri:Area/Main/Curation">000059</idno>
<idno type="wicri:Area/Main/Exploration">000059</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a">Mining a corpus of biographical texts using keywords</title>
<author wicri:is="90%">
<name sortKey="Conway, Mike" sort="Conway, Mike" uniqKey="Conway M" first="Mike" last="Conway">Mike Conway</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Japon</country>
<wicri:regionArea>National Institute of Informatics</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Japon</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Literary and Linguistic Computing</title>
<idno type="ISSN">0268-1145</idno>
<idno type="eISSN">1477-4615</idno>
<imprint>
<publisher>Oxford University Press</publisher>
<date type="published" when="2010-04">2010-04</date>
<biblScope unit="volume">25</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="23">23</biblScope>
<biblScope unit="page" to="35">35</biblScope>
</imprint>
<idno type="ISSN">0268-1145</idno>
</series>
<idno type="istex">7C87E90B31174A63CC37AC19EBEA8261A9D411BB</idno>
<idno type="DOI">10.1093/llc/fqp035</idno>
<idno type="ArticleID">fqp035</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0268-1145</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract">Using statistically derived keywords to characterize texts has become an important research method for digital humanists and corpus linguists in areas such as literary analysis and the exploration of genre difference. Keywordsand the associated concepts of keyness and key-keynesshave inspired conferences and workshops, many and varied research papers, and are central to several modern corpus processing tools. In this article, we present evidence that (at least for the task of biographical sentence classification) frequent words characterize texts better than keywords or key-keywords. Using the nave Bayes learning algorithm in conjunction with frequency-, keyword-, and key-keyword-based text representation to classify a corpus of biographical sentences, we discovered that the use of frequent words alone provided a classification accuracy better than either the keyword or key-keyword representations at a statistically significant level. This result suggests that (for the biographical sentence classification task at least) frequent words characterize texts better than keywords derived using more computationally intensive methods.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Japon</li>
</country>
</list>
<tree>
<country name="Japon">
<noRegion>
<name sortKey="Conway, Mike" sort="Conway, Mike" uniqKey="Conway M" first="Mike" last="Conway">Mike Conway</name>
</noRegion>
<name sortKey="Conway, Mike" sort="Conway, Mike" uniqKey="Conway M" first="Mike" last="Conway">Mike Conway</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Ticri/explor/TeiVM2/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000059 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000059 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Ticri
   |area=    TeiVM2
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:7C87E90B31174A63CC37AC19EBEA8261A9D411BB
   |texte=   Mining a corpus of biographical texts using keywords
}}

Wicri

This area was generated with Dilib version V0.6.31.
Data generation: Mon Oct 30 21:59:18 2017. Site generation: Sun Feb 11 23:16:06 2024